Assessment Task - Predicting Categories of Bank Transaction Data ¶
Assessment Task Summary¶
- In this assessment, we need to build a model that classifies transactions into the right financial categories. We will conduct a detailed analysis of the data, create a system for transforming features (including text and numbers), train various models, and develop a solution that we can explain and improve over time.
Table of Content¶
- Problem Statement
- Import Required Libraries
- Load the Dataset
- Understanding Data
- Exploratory Data Analysis
- Feature Engineering
- Models Building, Training, and Evaluation
- Models Performance Comparison and Results Interpretation
- Model Explainability and Interpretability using
LIMEandPDPs - Prediction Based on New User Data
- Steps to Enhance Model Performance
Problem Statement¶
- MoneyLion wants to help people understand and manage their money better by classifying their bank transactions. Each transaction can fit into categories like “Loans,” “Transfers,” or “Restaurants.” Our aim is to create a system that correctly assigns these categories to new transactions, helping users make better financial choices.
Import Required Libraries¶
# Basic libraries
import re
import string
import warnings
import ssl
import joblib
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# NLTK and text processing
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import wordnet as wn
# Visualization tools
from wordcloud import WordCloud, STOPWORDS
from sklearn.inspection import PartialDependenceDisplay
# Feature extraction and dimensionality reduction
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
# Preprocessing
from sklearn.preprocessing import LabelEncoder, RobustScaler, label_binarize
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
# Machine learning models
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
# Clustering
from sklearn.cluster import KMeans
# Metrics and evaluation
from sklearn.metrics import (
roc_curve, auc, classification_report, roc_auc_score, confusion_matrix
)
# Model selection
from sklearn.model_selection import train_test_split
# Lime for explainability
from lime.lime_tabular import LimeTabularExplainer
# Warnings
warnings.filterwarnings('ignore')
# SSL context adjustment for NLTK downloads (if necessary)
try:
_create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
pass
else:
ssl._create_default_https_context = _create_unverified_https_context
# NLTK downloads and setup
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('stopwords')
# NLTK utilities
STOP_WORDS = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()
[nltk_data] Downloading package averaged_perceptron_tagger_eng to [nltk_data] /Users/ehtishamsadiq/nltk_data... [nltk_data] Package averaged_perceptron_tagger_eng is already up-to- [nltk_data] date! [nltk_data] Downloading package wordnet to [nltk_data] /Users/ehtishamsadiq/nltk_data... [nltk_data] Package wordnet is already up-to-date! [nltk_data] Downloading package punkt to [nltk_data] /Users/ehtishamsadiq/nltk_data... [nltk_data] Package punkt is already up-to-date! [nltk_data] Downloading package stopwords to [nltk_data] /Users/ehtishamsadiq/nltk_data... [nltk_data] Package stopwords is already up-to-date!
Load the Dataset¶
bank_transactions = pd.read_csv("bank_transaction.csv")
user_profiles = pd.read_csv("user_profile.csv")
Understanding Data¶
print(f"Number of rows and columns in bank_transactions: {bank_transactions.shape}")
print(f"Number of rows and columns in user_profiles: {user_profiles.shape}")
Number of rows and columns in bank_transactions: (258779, 8) Number of rows and columns in user_profiles: (1000, 7)
# top sample records of bank_transactions
bank_transactions.sample(10)
| client_id | bank_id | account_id | txn_id | txn_date | description | amount | category | |
|---|---|---|---|---|---|---|---|---|
| 202957 | 880 | 804 | 925 | 33221 | 2023-09-25 00:00:00 | PURCHASE 0922 ZIP.CO* Maryse Hemant NY 3168316... | -4.700 | Gas Stations |
| 30326 | 315 | 1 | 1 | 91 | 2023-06-22 00:00:00 | THE MYRON STRATT Payroll 230622 968 Meaghan Pr... | 88.936 | Payroll |
| 251965 | 880 | 530 | 608 | 15634 | 2023-09-12 00:00:00 | CHECK111 | -13.104 | Supermarkets and Groceries |
| 175481 | 880 | 644 | 741 | 185616 | 2023-07-03 00:00:00 | KROGER #0 1025 07/01 #Maryse Hemant KROGER #0 ... | -3.932 | Supermarkets and Groceries |
| 120206 | 880 | 619 | 714 | 16864 | 2023-09-21 00:00:00 | Purchase SHELL SERVICE S NORTH BEND WAUS | -1.494 | Convenience Stores |
| 128889 | 880 | 399 | 449 | 141264 | 2023-08-18 08:19:38 | Mars Shave Ice, LLC | -1.430 | Supermarkets and Groceries |
| 62420 | 755 | 1 | 1 | 50 | 2023-09-15 00:00:00 | CASH APP*WIIPE*CASH OUSan FranciscoUS | 15.064 | Third Party |
| 223101 | 880 | 259 | 291 | 119477 | 2023-07-03 00:00:00 | Pos Debit- 9774 9774 Cash App*sendmyshx 103... | -2.000 | Third Party |
| 158582 | 880 | 481 | 547 | 49114 | 2023-07-13 00:00:00 | DEBIT CARD PURCHASE 3168 VENMO* Visa Direct NY | -4.000 | Third Party |
| 237652 | 880 | 788 | 906 | 156641 | 2023-07-02 00:00:00 | ATM Withdrawal ATM MIDTOWN PLAZA 2 4242 M... | -12.000 | ATM |
Observation:
Negative values in the
amountcolumn usually mean money is being spent or leaving the account. For example, a transaction atMcDonald'scategorized asRestaurantsor atMEIJERunderSupermarkets and Groceriesindicates an expense.Positive values represent money being added to the account. These could be deposits, refunds, or payroll credits. For example, a
VISA Money Transfer Creditcategorized asPayrollsuggests salary or income.
# sample ten records of user_profiles
user_profiles.sample(10)
| CLIENT_ID | IS_INTERESTED_INVESTMENT | IS_INTERESTED_BUILD_CREDIT | IS_INTERESTED_INCREASE_INCOME | IS_INTERESTED_PAY_OFF_DEBT | IS_INTERESTED_MANAGE_SPENDING | IS_INTERESTED_GROW_SAVINGS | |
|---|---|---|---|---|---|---|---|
| 410 | 411 | False | False | False | True | False | False |
| 929 | 930 | False | False | False | False | False | False |
| 860 | 861 | False | False | False | False | False | False |
| 32 | 33 | False | False | False | False | False | False |
| 491 | 492 | False | False | False | False | False | False |
| 216 | 217 | False | False | False | True | False | False |
| 159 | 160 | False | False | False | False | False | False |
| 484 | 485 | False | False | False | False | False | False |
| 510 | 511 | False | False | False | False | False | False |
| 184 | 185 | False | False | False | False | False | False |
# descriptrive statistics of bank_transactions
bank_transactions['amount'].describe()
count 258779.000000 mean 2.544952 std 81.132139 min -9162.460000 25% -6.000000 50% -1.876000 75% 2.000000 max 9397.830000 Name: amount, dtype: float64
Observation:
- The
amountcolumn has a wide range of values, from negative to positive, with a largeSTD. This means that some very large transactions are mixed with many smaller ones, which could affect model performance. To deal with this, we can transform the data to reduce the impact of extreme values or use outlier handling techniques.
Sanity Checks¶
1. Check Missing values in both data-frames
bank_transactions.isna().sum()
client_id 0 bank_id 0 account_id 0 txn_id 0 txn_date 0 description 0 amount 0 category 257 dtype: int64
Observations:
- There are only 257 missing values in the category column.
bank_transactions[bank_transactions['category'].isna()].sample(2)
| client_id | bank_id | account_id | txn_id | txn_date | description | amount | category | |
|---|---|---|---|---|---|---|---|---|
| 114185 | 880 | 862 | 994 | 124820 | 2023-08-08 19:00:00 | Cash App*Maryse Hemant | -4.20 | NaN |
| 64028 | 788 | 1 | 1 | 94 | 2023-07-30 19:00:00 | Cash app*cash out visa direct caus | 1.55 | NaN |
user_profiles.isna().sum() # no missing values
CLIENT_ID 0 IS_INTERESTED_INVESTMENT 0 IS_INTERESTED_BUILD_CREDIT 0 IS_INTERESTED_INCREASE_INCOME 0 IS_INTERESTED_PAY_OFF_DEBT 0 IS_INTERESTED_MANAGE_SPENDING 0 IS_INTERESTED_GROW_SAVINGS 0 dtype: int64
2. Check consistency in the data-types of the both data-frames
print(f"Data-types of bank_transactions:\n{bank_transactions.dtypes}\n\n")
print(f"Data-types of user_profiles:\n{user_profiles.dtypes}")
Data-types of bank_transactions: client_id int64 bank_id int64 account_id int64 txn_id int64 txn_date object description object amount float64 category object dtype: object Data-types of user_profiles: CLIENT_ID int64 IS_INTERESTED_INVESTMENT bool IS_INTERESTED_BUILD_CREDIT bool IS_INTERESTED_INCREASE_INCOME bool IS_INTERESTED_PAY_OFF_DEBT bool IS_INTERESTED_MANAGE_SPENDING bool IS_INTERESTED_GROW_SAVINGS bool dtype: object
Observation:
- data-type of
txn-datecolumn isobjectinstead ofdatatime.
3. Check duplicate values
print(f"Number of duplicates in bank_transactions: {bank_transactions.duplicated().sum()}") # no duplicates
print(f"Number of duplicates in user_profiles: {user_profiles.duplicated().sum()}") # no duplicates
Number of duplicates in bank_transactions: 0 Number of duplicates in user_profiles: 0
4. Check unique values
print(f"Uniuque values in bank_transactions:\n{bank_transactions.nunique()}\n\n")
print(f"Uniuque values in user_profiles:\n{user_profiles.nunique()}")
Uniuque values in bank_transactions: client_id 880 bank_id 990 account_id 1131 txn_id 190505 txn_date 7183 description 102108 amount 29120 category 33 dtype: int64 Uniuque values in user_profiles: CLIENT_ID 1000 IS_INTERESTED_INVESTMENT 2 IS_INTERESTED_BUILD_CREDIT 2 IS_INTERESTED_INCREASE_INCOME 2 IS_INTERESTED_PAY_OFF_DEBT 2 IS_INTERESTED_MANAGE_SPENDING 2 IS_INTERESTED_GROW_SAVINGS 2 dtype: int64
Observation
- The description column contains 102108 unique values out of 258779. This indicates that some values in the description column are repeated across multiple records.
5. Convert column names into lowercase
user_profiles.columns = user_profiles.columns.str.lower()
Merging Dataframes on client_id Column¶
- Both data frames contain a client_id column, which we can use to merge our data frames into one.
# merge two datasets bank_transactions and user_profiles, on column client_id
merged_df = pd.merge(bank_transactions, user_profiles, how='left', on='client_id')
merged_df.head()
| client_id | bank_id | account_id | txn_id | txn_date | description | amount | category | is_interested_investment | is_interested_build_credit | is_interested_increase_income | is_interested_pay_off_debt | is_interested_manage_spending | is_interested_grow_savings | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | 1 | 4 | 2023-09-29 00:00:00 | Earnin PAYMENT Donat... | 20.0 | Loans | False | False | False | False | False | False |
| 1 | 1 | 1 | 1 | 3 | 2023-08-14 00:00:00 | ONLINE TRANSFER FROM NDonatas DanyalDA O CARSO... | 25.0 | Transfer Credit | False | False | False | False | False | False |
| 2 | 1 | 1 | 1 | 5 | 2023-09-25 00:00:00 | MONEY TRANSFER AUTHOR... | 20.0 | Loans | False | False | False | False | False | False |
| 3 | 1 | 1 | 2 | 1 | 2023-06-02 00:00:00 | ONLINE TRANSFER FROM CARSON N EVERYDAY CHECKIN... | 16.0 | Transfer Credit | False | False | False | False | False | False |
| 4 | 1 | 1 | 2 | 2 | 2023-06-01 00:00:00 | ONLINE TRANSFER FROM CARSON N EVERYDAY CHECKIN... | 4.0 | Transfer Credit | False | False | False | False | False | False |
Analysis of category column¶
merged_df['category'].value_counts()
category Uncategorized 29392 Third Party 28714 Restaurants 26367 Transfer Credit 21561 Loans 19605 Convenience Stores 18630 Supermarkets and Groceries 16750 Transfer Debit 15114 Gas Stations 12919 Internal Account Transfer 11983 Payroll 8100 Shops 7418 Bank Fees 6432 Transfer 6275 ATM 5672 Transfer Deposit 4976 Digital Entertainment 4525 Utilities 4118 Clothing and Accessories 3190 Department Stores 2002 Insurance 1754 Service 910 Arts and Entertainment 397 Travel 367 Food and Beverage Services 343 Interest 280 Check Deposit 211 Healthcare 207 Telecommunication Services 159 Gyms and Fitness Centers 69 Payment 41 Bank Fee 36 Tax Refund 5 Name: count, dtype: int64
Observation
- The category column is dominated by a few categories like
Uncategorized,Third-Party, andRestaurants, while many other categories have significantly fewer entries, which might indicate a class imbalance issue that could affect the performance of machine learning models. - To make things simpler and balance the categories better, we could think about merging some of them. Here are a few ideas:
- Bank Fees and Bank Fee: These are similar, so we could just keep one category called "Bank Fees."
- Transfer, Transfer Credit, Transfer Debit, and Transfer Deposit: Since all of these relate to transfers, we could combine them into a single "Transfers" category.
- Food and Beverage Services and Restaurants: While they are different, we could merge these into "Food and Dining" if we don’t need to keep the distinction.
- Digital Entertainment and Arts and Entertainment: These could be combined into an overall "Entertainment" category.
- Convenience Stores, Supermarkets, and Groceries: We could group these into a broader "Retail and Groceries" category if we don’t need to differentiate between them.
- Utilities and Telecommunication Services: These could easily be merged into one category called "Services."
- Gyms and Fitness Centers and Healthcare: We could combine these into a broader "Health and Wellness" category.
- Department Stores and Shops: These could be simplified into a single category called "Retail."
- Payment and Check Deposit: We could group these into a "Deposits and Payments" category.
Category Mapping¶
def merge_category(cat):
if cat in ["Bank Fee", "Bank Fees"]:
return "Bank Fees"
elif cat in ["Food and Beverage Services", "Restaurants"]:
return "Food and Dining"
elif cat in ["Digital Entertainment", "Arts and Entertainment"]:
return "Entertainment"
elif cat in ["Gyms and Fitness Centers", "Healthcare"]:
return "Health and Wellness"
else:
return cat
merged_df['category'] = merged_df['category'].apply(merge_category)
merged_df['category'].value_counts()
category Uncategorized 29392 Third Party 28714 Food and Dining 26710 Transfer Credit 21561 Loans 19605 Convenience Stores 18630 Supermarkets and Groceries 16750 Transfer Debit 15114 Gas Stations 12919 Internal Account Transfer 11983 Payroll 8100 Shops 7418 Bank Fees 6468 Transfer 6275 ATM 5672 Transfer Deposit 4976 Entertainment 4922 Utilities 4118 Clothing and Accessories 3190 Department Stores 2002 Insurance 1754 Service 910 Travel 367 Interest 280 Health and Wellness 276 Check Deposit 211 Telecommunication Services 159 Payment 41 Tax Refund 5 Name: count, dtype: int64
Why Should We Consider “Uncategorized” as Test Data?¶
- I selected the
Uncategorizedcategory as the test dataset because I will use it to explore how my model handles unknown or unclassified transactions. By using clustering and visualization techniques like K-Means and t-SNE, I can identify potential groupings within these uncategorized transactions that align with known categories.
test_df = merged_df[(merged_df['category']=="Uncategorized") | (merged_df['category'].isna())] # selecting rows with category 'Uncategorized' or missing category
print(f"Number of rows in merged_df before dropping rows with category 'Uncategorized': {merged_df.shape[0]}")
merged_df.drop(test_df.index, inplace=True)
merged_df.reset_index(drop=True, inplace=True) # dropping rows with category 'Uncategorized'
print(f"Number of rows in test_df: {test_df.shape[0]}")
print(f"Number of rows in merged_df after dropping rows with category 'Uncategorized': {merged_df.shape[0]}")
Number of rows in merged_df before dropping rows with category 'Uncategorized': 258779 Number of rows in test_df: 29649 Number of rows in merged_df after dropping rows with category 'Uncategorized': 229130
descriptions = test_df['description'].fillna("")
tfidf = TfidfVectorizer( stop_words='english',max_features=5000)
X_uncat = tfidf.fit_transform(descriptions)
kmeans = KMeans(n_clusters=5, random_state=42)
kmeans_labels = kmeans.fit_predict(X_uncat)
tsne = TSNE(n_components=2, random_state=42, perplexity=30, n_iter=1000)
X_tsne = tsne.fit_transform(X_uncat.toarray())
# X_uncat, kmeans_labels, X_tsne
print(f"X_uncat is a sparse matrix representation of the descriptions: {type(X_uncat)}")
print(f"kmeans_labels is an array of cluster labels: {type(kmeans_labels)}")
print(f"X_tsne is an array of t-SNE coordinates: {type(X_tsne)}")
X_uncat is a sparse matrix representation of the descriptions: <class 'scipy.sparse._csr.csr_matrix'> kmeans_labels is an array of cluster labels: <class 'numpy.ndarray'> X_tsne is an array of t-SNE coordinates: <class 'numpy.ndarray'>
plt.figure(figsize=(10, 7))
scatter = plt.scatter(
X_tsne[:, 0],
X_tsne[:, 1],
c=kmeans_labels,
cmap='viridis',
alpha=0.7
)
plt.colorbar(scatter, label='Cluster Label')
plt.title("t-SNE Clustering of Uncategorized Transactions")
plt.xlabel("t-SNE Dimension 1")
plt.ylabel("t-SNE Dimension 2")
plt.show()
test_df['cluster'] = kmeans_labels
for c_id in sorted(test_df['cluster'].unique()):
print(f"\nCluster {c_id}:")
sample_rows = test_df[test_df['cluster'] == c_id].head(5)
for desc in sample_rows['description']:
print(" ", desc)
Cluster 0: CHECK111 From Savings - 7762 From Savings - 7762 From Savings - 7762 From Savings - 7762 Cluster 1: Maryse Hemant FROM Maryse Hemant RASOALEJANDRE ON 08/10 REF # BACJFCCGPI30 Maryse Hemant FROM Maryse Hemant Maryse Hemant ON 09/07 REF # BACHVG7CR09W Maryse Hemant FROM Maryse Hemant RASOALEJANDRE ON 08/10 REF # BACLZXBKSJ7G Maryse Hemant FROM Maryse Hemant ON 07/24 REF # BACMBPJUFW5N Maryse Hemant FROM Maryse Hemant ON 07/09 REF # NAV0HW6363TB THANKS Cluster 2: Myra Gosia FROM Myra Gosia ON 07/15 REF # PP0RDWM74Z Myra Gosia FROM Myra Gosia ON 09/08 REF # PP0RK4F4QZ Myra Gosia FROM Myra Gosia ON 07/03 REF # PP0RD3YGVY BILLS Myra Gosia FROM Myra Gosia ON 07/15 REF # PP0RDWM74Z Myra Gosia FROM Myra Gosia ON 07/30 REF # BWS0HWRE20ZB PIZZA Cluster 3: Empower RTP Credit RCVD from Empower Empower RTP CREDIT Empower RTP CREDIT Empower RTP CREDIT Cluster 4: 360 Checking Card Adjustment Signature (Credit) TARGET COM 3600 MN 360 Checking Card Adjustment Signature (Credit) TARGET COM 3600 MN Insta Cash Repayment CHECK CARD REFUND CHECK CARD REFUND
Observations from above Clusters¶
- Cluster 0: This group includes transactions from
Empower, specifically RTP (Real-Time Payments) credits. The frequent mention ofEmpowerindicates these are likely payments or transfers from a financial service called Empower. This could involve loans, repayments, or money transfers, highlighting it as a finance-related category. - Cluster 1: This seems to focus on
Save Your Change, likely a savings program or app. It probably rounds up purchases to the nearest dollar and saves the extra money. This could fit under categories like Savings, Investment, or Financial Services, especially as an automated savings tool or a feature in a banking app.
# find all the records from merged_df where the description contains the Empower value in it
merged_df[merged_df['description'].str.contains("Empower", na=False)][['description','category']].sample(5)
| description | category | |
|---|---|---|
| 136426 | 3168 Empower Inc 6/23 TheaHall ACH DEBIT | Loans |
| 33115 | Point Of Sale Deposit - Empower Finance, InVis... | Loans |
| 164860 | Empower TRANSFER 3168 | Transfer Credit |
| 22029 | Transfer Empower ; "Empower Cash Advance" | Loans |
| 13669 | Transfer Empower ; "Empower Cash Advance" | Loans |
Exploratory Data Analysis¶
Univariate-Analysis¶
Visualization of category and amount column
sns.set_style("darkgrid")
sns.set_palette("deep")
category_counts = merged_df['category'].value_counts()[0:15]
fig, axes = plt.subplots(1, 2, figsize=(15, 8))
# create Plotting a bar chart using seaborn
sns.countplot(data=merged_df, x='category', order=category_counts.index, ax=axes[0])
axes[0].set_title('Distribution of Categories')
axes[0].set_xlabel('Category')
axes[0].set_ylabel('Frequency')
axes[0].tick_params(axis='x', rotation=70)
axes[1].pie(category_counts.values, labels=category_counts.index, autopct='%1.1f%%', textprops={'fontsize': 12})
axes[1].set_title('Proportion of Categories')
plt.tight_layout()
plt.show()
Analysis of amount column
print(f"Kurtosis of amount: {merged_df['amount'].kurtosis()}")
print(f"Skewness of amount: {merged_df['amount'].skew()}")
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
sns.histplot(
data=merged_df,
x='amount',
kde=True, # Show KDE (kernel density estimate) curve
color='blue',
ax=axes[0]
)
axes[0].set_title('Distribution of Amount')
sns.boxplot(
data=merged_df,
x='amount',
color='orange',
ax=axes[1]
)
axes[1].set_title('Box Plot of Amount')
plt.tight_layout()
plt.show()
Kurtosis of amount: 2430.822253700687 Skewness of amount: 10.38726423364106
The amount column shows a very uneven distribution:
High Skewness (10.39):
- This number indicates a strong right skew, meaning most values are small, but there are a few very large amounts.
Very High Kurtosis (2430.82):
- This suggests that there are many outliers and that most data points cluster around lower amounts, with a few very high values stretching the tail.
Why It’s Important:
- This imbalance can distort average calculations, making standard statistical measures unreliable. We may need to apply transformations (like taking the log) or deal with outliers to improve the analysis and modeling.
Data Transformation¶
from sklearn.preprocessing import PowerTransformer
sns.set_theme(style="darkgrid", palette="deep")
# log transform (adding 1.5 to avoid log(0))
merged_df['log_amount'] = np.sign(merged_df['amount']) * np.log(np.abs(merged_df['amount']) + 1.5)
print("=== LOG TRANSFORMATION APPLIED ===")
print(f"Kurtosis of log_amount: {merged_df['log_amount'].kurtosis()}")
print(f"Skewness of log_amount: {merged_df['log_amount'].skew()}\n")
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
sns.histplot(
data=merged_df,
x='log_amount',
kde=True,
color='blue',
ax=axes[0]
)
axes[0].set_title('Log-Transformed Amount Distribution')
sns.boxplot(
data=merged_df,
x='log_amount',
color='orange',
ax=axes[1]
)
axes[1].set_title('Log-Transformed Amount Box Plot')
plt.tight_layout()
plt.show()
#--------------------------------------------------------------------------
# YEO-JOHNSON TRANSFORMATION
#--------------------------------------------------------------------------
pt = PowerTransformer(method='yeo-johnson', standardize=True)
merged_df['yj_amount'] = pt.fit_transform(merged_df[['amount']])
print("=== YEO-JOHNSON TRANSFORMATION APPLIED ===")
print(f"Kurtosis of yj_amount: {merged_df['yj_amount'].kurtosis()}")
print(f"Skewness of yj_amount: {merged_df['yj_amount'].skew()}\n")
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
sns.histplot(
data=merged_df,
x='yj_amount',
kde=True,
color='blue',
ax=axes[0]
)
axes[0].set_title('Yeo-Johnson Amount Distribution')
sns.boxplot(
data=merged_df,
x='yj_amount',
color='orange',
ax=axes[1]
)
axes[1].set_title('Yeo-Johnson Amount Box Plot')
plt.tight_layout()
plt.show()
=== LOG TRANSFORMATION APPLIED === Kurtosis of log_amount: -0.15163612062119158 Skewness of log_amount: 0.6870711783624583
=== YEO-JOHNSON TRANSFORMATION APPLIED === Kurtosis of yj_amount: 4388.583833520956 Skewness of yj_amount: -19.531379583528942
Distribution plot for no of words in each document
def count_words(sentence):
delimiters = re.escape(string.punctuation)
string_to_split = sentence
result = re.findall(r'\b(?!\d+\b)\w+\b|' + delimiters, string_to_split)
result = [s for s in result if not re.match(delimiters, s)]
return len(result)
merged_df['count'] = merged_df['description'].apply(count_words)
#plotting the distribution plot for no of words in each document
import seaborn as sns
plt.figure(figsize=(8,8))
sns.distplot(merged_df['count'])
plt.xlim(0,40)
plt.xlabel('The no of words', fontsize = 16)
plt.title('The no of words distribtuion', fontsize = 18)
plt.show();
# **We can see that most of the Descritpions have number of words range between 5 to 15**
We can see that most of the Descriptions have a number of words ranging between 2 to 10
Bi-variate Analysis and Multi-variate Analysis¶
- How many users are interested in each financial goal?
- How does the average transaction amount vary for users with different interests?
- What is the distribution of transaction amounts across categories?
- How often do users with certain interests (e.g., pay off debt) spend?
- Do any of the interest flags tend to co-occur?
- Which day(s) of the week or time of month has the highest transaction activity?
- How do user interests intersect with transaction categories?
How many users are interested in each financial goal?
- For each column, I can plot how many users have True vs. False. This reveals the overall distribution of interest flags in the user population.
interest_cols = [
'is_interested_investment',
'is_interested_build_credit',
'is_interested_increase_income',
'is_interested_pay_off_debt',
'is_interested_manage_spending',
'is_interested_grow_savings'
]
melted_df = merged_df[interest_cols].reset_index(drop=True)
melted_df = melted_df.melt(var_name='interest_flag', value_name='interest_value')
merged_df.head(1)
| client_id | bank_id | account_id | txn_id | txn_date | description | amount | category | is_interested_investment | is_interested_build_credit | is_interested_increase_income | is_interested_pay_off_debt | is_interested_manage_spending | is_interested_grow_savings | log_amount | yj_amount | count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | 1 | 4 | 2023-09-29 00:00:00 | Earnin PAYMENT Donat... | 20.0 | Loans | False | False | False | False | False | False | 3.068053 | 0.23842 | 4 |
plt.figure(figsize=(10, 5))
sns.countplot(
data=melted_df,
x='interest_flag',
hue='interest_value' # True/False
)
plt.title("Number of Users Interested (True/False) in Each Financial Goal")
plt.xlabel("Interest Flags")
plt.ylabel("Count of Users")
plt.legend(title="Interest Value")
plt.xticks(rotation=30)
plt.tight_layout()
plt.show()
How does the average transaction amount vary for users with different interests?
- Box plots let us compare how transaction amounts differ between users who have a certain interest (True) and those who do not (False).
plt.figure(figsize=(12, 6))
sns.boxplot(
data=merged_df,
x='is_interested_investment', # True/False on x-axis
y='amount'
)
plt.title("Transaction Amount by Investment Interest (True/False)")
plt.xlabel("Is Interested in Investment?")
plt.ylabel("Transaction Amount")
plt.tight_layout()
plt.show()
What is the distribution of transaction amounts across categories?
- This helps us to see which categories have large or small transaction amounts.
plt.figure(figsize=(15, 8))
sns.boxplot(
data=merged_df,
x='category',
y='amount'
)
plt.title("Distribution of Transaction Amounts by Category")
plt.xlabel("Category")
plt.ylabel("Transaction Amount")
plt.xticks(rotation=75)
plt.tight_layout()
plt.show()
How often do users with certain interests (e.g., pay off debt) spend?
- Compare the proportion of different categories for True vs. False in a pay-off debt interest column.
plt.figure(figsize=(15, 8))
sns.countplot(
data=merged_df,
x='category',
hue='is_interested_pay_off_debt'
)
plt.title("Category Frequency by Pay-Off-Debt Interest (True/False)")
plt.xlabel("Category")
plt.ylabel("Count of Transactions")
plt.xticks(rotation=70)
plt.legend(title="Is Interested in Paying Off Debt?")
plt.tight_layout()
plt.show()
Do any of the interest flags tend to co-occur?
- This indicates if many users wanting to “grow savings” also wish to “manage spending,” helping you understand the relationships among interest flags.
interest_df = merged_df[interest_cols].astype(int) # Convert True/False to 1/0
corr_matrix = interest_df.corr()
plt.figure(figsize=(15, 8))
sns.heatmap(
corr_matrix,
annot=True,
cmap='Blues',
fmt=".2f",
square=True
)
plt.title("Correlation Heatmap of User Interest Flags")
plt.tight_layout()
plt.show()
Which day(s) of the week or time of month have the highest transaction activity?
merged_df['txn_date'] = pd.to_datetime(merged_df['txn_date'], errors='coerce')
merged_df['day_of_week'] = merged_df['txn_date'].dt.day_name() # Monday, Tuesday
plt.figure(figsize=(8, 5))
sns.countplot(
data=merged_df,
x='day_of_week',
order=['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday']
)
plt.title("Transaction Counts by Day of the Week")
plt.xlabel("Day of Week")
plt.ylabel("Count of Transactions")
plt.tight_layout()
plt.show()
# For day of month, if you prefer:
merged_df['day_of_month'] = merged_df['txn_date'].dt.day
plt.figure(figsize=(10, 5))
sns.countplot(data=merged_df, x='day_of_month')
plt.title("Transaction Counts by Day of Month")
plt.xlabel("Day of Month")
plt.ylabel("Count of Transactions")
plt.tight_layout()
plt.show()
How do user interests intersect with transaction categories?
- By creating a cross-tabulation of category versus interest flag counts, we can determine if certain categories are particularly dominant among interested and not-interested user groups.
pivot_data = pd.crosstab(merged_df['category'], merged_df['is_interested_investment'])
# pivot_data has rows as categories and columns as True/False (0/1 if we cast)
plt.figure(figsize=(8, 8))
sns.heatmap(
pivot_data,
annot=True,
fmt='d',
cmap='Blues'
)
plt.title("Cross-Tab Heatmap: Category vs. Investment Interest")
plt.ylabel("Category")
plt.xlabel("Is Interested in Investment (False=0, True=1)")
plt.tight_layout()
plt.show()
WordCloud from the description Column
- A WordCloud helps visualize the most frequent words in the description text.
# Combine all descriptions into one string, handling nulls
text = " ".join(str(desc) for desc in merged_df['description'].dropna())
stopwords = set(STOPWORDS)
wordcloud = WordCloud(
width=800,
height=400,
background_color='white',
stopwords=stopwords,
max_words=200 # limit number of words
).generate(text)
# Display the generated image
plt.figure(figsize=(12,6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off") # turn off axis lines/ticks
plt.title("WordCloud of Transaction Descriptions")
plt.tight_layout()
plt.show()
Feature Engineering¶
1. Extract day, month, and year values
from sklearn.pipeline import FunctionTransformer
def parse_txn_date(df: pd.DataFrame) -> pd.DataFrame:
df = df.copy()
df['txn_date'] = pd.to_datetime(df['txn_date'], errors='coerce')
df['day_of_week'] = df['txn_date'].dt.dayofweek
df['day_of_month'] = df['txn_date'].dt.day
df['month'] = df['txn_date'].dt.month
df['year'] = df['txn_date'].dt.year
df.drop(columns=['txn_date'], inplace=True, errors='ignore')
return df
parse_date_transformer = FunctionTransformer(parse_txn_date, validate=False)
2. Drop unnecessary columns
UNNECESSARY_COLS = [
'client_id', 'bank_id', 'account_id', 'txn_id',
'count', 'scaled_data', 'transformed_amount_yj',
'log_amount', 'yj_amount',
'year' # if it’s the same value for all rows
]
def drop_unnecessary_columns(df: pd.DataFrame) -> pd.DataFrame:
df = df.copy()
for col in UNNECESSARY_COLS:
if col in df.columns:
df.drop(columns=[col], inplace=True, errors='ignore')
return df
drop_cols_transformer = FunctionTransformer(drop_unnecessary_columns, validate=False)
3. Conversion of is_interested columns into boolean values
def convert_booleans(df: pd.DataFrame) -> pd.DataFrame:
df = df.copy()
interested_cols = [c for c in df.columns if c.startswith('is_interested')]
for col in interested_cols:
# Convert True/False to 1/0
df[col] = df[col].apply(lambda x: 1 if x else 0)
return df
bool_transformer = FunctionTransformer(convert_booleans, validate=False)
4. Transformation of the amount column using log transformation
def log_transform_amount(df: pd.DataFrame) -> pd.DataFrame:
df = df.copy()
# log transform: sign(amount) * log(abs(amount) + 1.5)
if 'amount' in df.columns:
df['amount'] = np.sign(df['amount']) * np.log(np.abs(df['amount']) + 1.5)
return df
log_transformer = FunctionTransformer(log_transform_amount, validate=False)
5. Text Vectorization of description column and why PCA is important for Description for column?
- Remove all stopwords and punctuation marks, and convert all textual data to lowercase.
- Lemmatization to generate a meaningful word out of the corpus of words.
- Tokenization of corpus and Word vectorization using
tfidf
def clean_sentence(sentence: str) -> str:
if not isinstance(sentence, str):
return ""
delimiters = re.escape(string.punctuation)
# \b(?!\d+\b)\w+\b matches word-like tokens that are not pure digits
result = re.findall(r'\b(?!\d+\b)\w+\b|' + delimiters, sentence)
return " ".join([s for s in result if not re.match(delimiters, s)])
def remove_stopwords(text: str) -> str:
tokens = [word.lower() for word in text.split() if word.lower() not in STOP_WORDS]
return " ".join(tokens)
def get_wordnet_pos(treebank_tag: str) -> str:
if treebank_tag.startswith('J'):
return wn.ADJ
elif treebank_tag.startswith('V'):
return wn.VERB
elif treebank_tag.startswith('R'):
return wn.ADV
else:
return wn.NOUN
def lemmatize_text(text: str) -> str:
tokens = word_tokenize(text)
pos_tags = pos_tag(tokens)
lemmatized_tokens = []
for token, pos_ in pos_tags:
wordnet_tag = get_wordnet_pos(pos_)
lemma = lemmatizer.lemmatize(token.lower(), pos=wordnet_tag)
lemmatized_tokens.append(lemma)
return ' '.join(lemmatized_tokens)
def clean_and_lemmatize_description(df: pd.DataFrame) -> pd.DataFrame:
df = df.copy()
if 'description' in df.columns:
df['description'] = df['description'].apply(lambda x: ' '.join(str(x).split()))
df['description'] = df['description'].apply(lambda x: x.strip())
df['description'] = df['description'].apply(clean_sentence)
df['description'] = df['description'].apply(remove_stopwords)
df['description'] = df['description'].apply(lemmatize_text)
return df
model_data = merged_df.copy()
model_data = clean_and_lemmatize_description(model_data)
tfidf = TfidfVectorizer(stop_words='english', lowercase=False,max_features = 1000)
tfidf.fit(merged_df['description'])
dictionary = tfidf.vocabulary_.items()
df_vector = tfidf.transform(merged_df['description']).toarray()
print(f'shape of the vector : {df_vector.shape}')
# use PCA to reduce dimensionality
pca = PCA(random_state=42)
pca.fit(df_vector)
# Explained variance for different number of components
fig, axes = plt.subplots(1, 2, figsize=(20, 5))
# for all components
axes[0].plot(np.cumsum(pca.explained_variance_ratio_))
axes[0].set_title('PCA - cumulative explained variance vs all components')
axes[0].set_xlabel('number of components')
axes[0].set_ylabel('cumulative explained variance')
axes[0].axhline(y=0.8, color='red', linestyle='--')
axes[0].axvline(x=100, color='green', linestyle='--')
# for zoomed to first 100 components
axes[1].plot(np.cumsum(pca.explained_variance_ratio_[:100]))
axes[1].set_title('PCA - cumulative explained variance vs first 100 components')
axes[1].set_xlabel('number of components')
axes[1].set_ylabel('cumulative explained variance')
axes[1].axhline(y=0.8, color='red', linestyle='--')
axes[1].axvline(x=100, color='green', linestyle='--')
plt.tight_layout()
plt.show()
shape of the vector : (229130, 1000)
Note: More than 80% of the variance is explained just by 100 components.
clean_desc_transformer = FunctionTransformer(clean_and_lemmatize_description, validate=False)
Feature Engineering Pipeline¶
def get_feature_pipeline():
# Steps of pipeline
# Date parse -> Drop columns -> Boolean conv -> Log transform -> Clean description
feature_preprocessing = Pipeline([
('parse_date', parse_date_transformer),
('drop_cols', drop_cols_transformer),
('bool_convert', bool_transformer),
('log_amt', log_transformer),
('clean_desc', clean_desc_transformer)
])
numeric_cols = ['amount', 'day_of_week', 'day_of_month', 'month']
binary_cols = []
text_col = 'description'
numeric_transformer = Pipeline([
('scaler', RobustScaler())
])
text_transformer = Pipeline([
('tfidf', TfidfVectorizer(stop_words='english', lowercase=False, max_features=1000))
])
final_preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_cols),
('text', text_transformer, text_col),
('binary', 'passthrough', binary_cols),
],
remainder='drop'
)
# create a chain final_preprocessor -> PCA, PCA will operate on the combined numeric + TF-IDF matrix
pipeline_full = Pipeline([
('custom_steps', feature_preprocessing),
('final_preprocessor', final_preprocessor),
('pca', PCA(n_components=100, random_state=42))
])
return pipeline_full
le = LabelEncoder()
merged_df['category'] = le.fit_transform(merged_df['category'])
X = merged_df.drop(columns=['category'], errors='ignore')
y = merged_df['category'].copy()
Train Test Split
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.3,
stratify=y,
random_state=42
)
Build and Fit the Pipeline
# create instance of the pipeline
feature_pipeline = get_feature_pipeline()
# fit the pipeline
feature_pipeline.fit(X_train)
# transform both train and test sets
X_train_transformed = feature_pipeline.transform(X_train)
X_test_transformed = feature_pipeline.transform(X_test)
test_df.drop(columns=['category'], inplace=True, errors='ignore')
test_df_transformed = feature_pipeline.transform(test_df)
print("Train shape after pipeline:", X_train_transformed.shape)
print("Test shape after pipeline:", X_test_transformed.shape)
print("Test shape after pipeline:", test_df_transformed.shape)
Train shape after pipeline: (160391, 100) Test shape after pipeline: (68739, 100) Test shape after pipeline: (29649, 100)
Save Feature Engineering Pipeline
# Save Feature Engineering Pipeline
joblib.dump(feature_pipeline, 'models/feature_engineering_pipeline.pkl')
['models/feature_engineering_pipeline.pkl']
Helper Functionn to evaluate model Performance¶
def evaluate_model(model, X, y):
y_pred = model.predict(X)
y_proba = model.predict_proba(X)
try:
roc_auc = roc_auc_score(y, y_proba, multi_class='ovr', average='macro')
except ValueError:
# if there's only 1 class in y, fallback gracefully
roc_auc = float('nan')
from sklearn.metrics import accuracy_score
acc = accuracy_score(y, y_pred)
cr = classification_report(y, y_pred)
cm = confusion_matrix(y, y_pred)
return {
'accuracy': acc,
'roc_auc_ovr': roc_auc,
'classification_report': cr,
'confusion_matrix': cm,
'y_pred': y_pred
}
Help function to plot roc curve¶
def plot_roc_curve_multi_class(model, X, y, ax=None, title="ROC Curve"):
y_score = model.predict_proba(X)
classes = np.unique(y)
n_classes = len(classes)
y_bin = label_binarize(y, classes=classes)
fpr = {}
tpr = {}
roc_auc = {}
for i in range(n_classes):
fpr[i], tpr[i], _ = roc_curve(y_bin[:, i], y_score[:, i])
roc_auc[i] = auc(fpr[i], tpr[i])
# for micro-average
fpr["micro"], tpr["micro"], _ = roc_curve(y_bin.ravel(), y_score.ravel())
roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])
# for macro-average
all_fpr = np.unique(np.concatenate([fpr[i] for i in range(n_classes)]))
mean_tpr = np.zeros_like(all_fpr)
for i in range(n_classes):
mean_tpr += np.interp(all_fpr, fpr[i], tpr[i])
mean_tpr /= n_classes
fpr["macro"] = all_fpr
tpr["macro"] = mean_tpr
roc_auc["macro"] = auc(fpr["macro"], tpr["macro"])
if ax is None:
fig, ax = plt.subplots()
ax.plot(fpr["micro"], tpr["micro"],
label='micro-average ROC (area = {0:0.2f})'
''.format(roc_auc["micro"]),
color='deeppink', linestyle=':', linewidth=4)
ax.plot(fpr["macro"], tpr["macro"],
label='macro-average ROC (area = {0:0.2f})'
''.format(roc_auc["macro"]),
color='navy', linestyle=':', linewidth=4)
# Plot ROC curve for each class
for i in range(n_classes):
ax.plot(fpr[i], tpr[i], lw=2, label='Class {0} (area = {1:0.2f})'
''.format(classes[i], roc_auc[i]))
ax.plot([0, 1], [0, 1], 'k--', lw=2)
ax.set_xlim([0.0, 1.0])
ax.set_ylim([0.0, 1.05])
ax.set_xlabel('False Positive Rate')
ax.set_ylabel('True Positive Rate')
ax.set_title(title)
ax.legend(loc="lower right")
2. Train Multiple Models¶
models = {}
results = {}
1. Random Forest¶
model_name = "Random Forest"
print(f"Training Random Forest...")
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train_transformed, y_train)
models['RandomForest'] = rf
print(f"Random Forest Done")
Training Random Forest... Random Forest Done
print(f"\n=== Evaluating {model_name} on Train Set ===")
train_eval = evaluate_model(rf, X_train_transformed, y_train)
print("Accuracy:", train_eval['accuracy'])
print("ROC AUC (macro):", train_eval['roc_auc_ovr'])
print("Classification Report:\n", train_eval['classification_report'])
print(f"\n=== Evaluating {model_name} on Test Set ===")
test_eval = evaluate_model(rf, X_test_transformed, y_test)
print("Accuracy:", test_eval['accuracy'])
print("ROC AUC (macro):", test_eval['roc_auc_ovr'])
print("Classification Report:\n", test_eval['classification_report'])
# save the model in the model's folder
joblib.dump(rf, f"models/{model_name}_model.pkl")
if test_df_transformed is not None and test_df_transformed.shape[0] > 0:
print(f"\n=== {model_name} Predictions on test_df_transformed ===")
pred_uncat = rf.predict(test_df_transformed)
print("Predicted categories (first 10):", pred_uncat[:10])
results[model_name] = {
'train_eval': train_eval,
'test_eval': test_eval
}
=== Evaluating Random Forest on Train Set ===
Accuracy: 0.9976120854661421
ROC AUC (macro): 0.9999945187388483
Classification Report:
precision recall f1-score support
0 1.00 1.00 1.00 3970
1 1.00 1.00 1.00 4528
2 1.00 1.00 1.00 148
3 1.00 1.00 1.00 2233
4 1.00 1.00 1.00 13041
5 0.98 0.97 0.98 1401
6 1.00 0.99 1.00 3445
7 0.99 1.00 0.99 18697
8 1.00 1.00 1.00 9043
9 0.99 0.97 0.98 193
10 1.00 1.00 1.00 1228
11 1.00 1.00 1.00 196
12 1.00 1.00 1.00 8388
13 1.00 1.00 1.00 13724
14 1.00 1.00 1.00 29
15 1.00 1.00 1.00 5670
16 0.99 0.99 0.99 637
17 1.00 1.00 1.00 5193
18 1.00 1.00 1.00 11725
19 1.00 1.00 1.00 3
20 1.00 0.99 1.00 111
21 1.00 1.00 1.00 20100
22 1.00 1.00 1.00 4392
23 1.00 1.00 1.00 15093
24 1.00 1.00 1.00 10580
25 1.00 1.00 1.00 3483
26 1.00 0.99 0.99 257
27 0.99 0.98 0.98 2883
accuracy 1.00 160391
macro avg 1.00 1.00 1.00 160391
weighted avg 1.00 1.00 1.00 160391
=== Evaluating Random Forest on Test Set ===
Accuracy: 0.9046247399583933
ROC AUC (macro): 0.9555875199392778
Classification Report:
precision recall f1-score support
0 0.99 0.99 0.99 1702
1 0.99 0.99 0.99 1940
2 1.00 0.92 0.96 63
3 0.74 0.58 0.65 957
4 0.79 0.87 0.83 5589
5 0.67 0.45 0.54 601
6 0.89 0.88 0.89 1477
7 0.84 0.85 0.85 8013
8 0.75 0.77 0.76 3876
9 0.64 0.43 0.52 83
10 0.79 0.71 0.75 526
11 0.92 0.98 0.95 84
12 1.00 0.99 1.00 3595
13 0.95 0.96 0.95 5881
14 0.79 0.92 0.85 12
15 0.94 0.96 0.95 2430
16 0.70 0.47 0.57 273
17 0.88 0.79 0.83 2225
18 0.84 0.83 0.83 5025
19 0.00 0.00 0.00 2
20 0.72 0.58 0.64 48
21 0.98 0.98 0.98 8614
22 0.95 0.92 0.94 1883
23 0.98 0.99 0.99 6468
24 0.99 0.99 0.99 4534
25 0.96 0.97 0.96 1493
26 0.84 0.75 0.79 110
27 0.82 0.78 0.80 1235
accuracy 0.90 68739
macro avg 0.83 0.80 0.81 68739
weighted avg 0.90 0.90 0.90 68739
=== Random Forest Predictions on test_df_transformed ===
Predicted categories (first 10): [18 18 13 21 13 13 13 13 13 13]
# plot roc for both training and testing
fig, axes = plt.subplots(1, 2, figsize=(15, 8))
plot_roc_curve_multi_class(rf, X_train_transformed, y_train, ax=axes[0], title=f"{model_name} - Train ROC")
plot_roc_curve_multi_class(rf, X_test_transformed, y_test, ax=axes[1], title=f"{model_name} - Test ROC")
plt.tight_layout()
plt.show()
2. Logistic Regression¶
print(f"Training Logistic Regression...")
lr= LogisticRegression(multi_class='multinomial', solver='lbfgs')
lr.fit(X_train_transformed, y_train)
models['LogisticRegression'] = lr
print(f"Logistic Regression Done")
Training Logistic Regression... Logistic Regression Done
model_name = "Logistic Regression"
print(f"\n=== Evaluating {model_name} on Train Set ===")
train_eval = evaluate_model(lr, X_train_transformed, y_train)
print("Accuracy:", train_eval['accuracy'])
print("ROC AUC (macro):", train_eval['roc_auc_ovr'])
print("Classification Report:\n", train_eval['classification_report'])
print(f"\n=== Evaluating {model_name} on Test Set ===")
test_eval = evaluate_model(lr, X_test_transformed, y_test)
print("Accuracy:", test_eval['accuracy'])
print("ROC AUC (macro):", test_eval['roc_auc_ovr'])
print("Classification Report:\n", test_eval['classification_report'])
# save the model in the model's folder
joblib.dump(lr, f"models/{model_name}_model.pkl")
if test_df_transformed is not None and test_df_transformed.shape[0] > 0:
print(f"\n=== {model_name} Predictions on test_df_transformed ===")
pred_uncat = lr.predict(test_df_transformed)
print("Predicted categories (first 10):", pred_uncat[:10])
results[model_name] = {
'train_eval': train_eval,
'test_eval': test_eval
}
=== Evaluating Logistic Regression on Train Set ===
Accuracy: 0.765342195010942
ROC AUC (macro): 0.9560158440257897
Classification Report:
precision recall f1-score support
0 0.95 0.99 0.97 3970
1 0.95 0.94 0.94 4528
2 0.76 0.74 0.75 148
3 0.61 0.32 0.42 2233
4 0.55 0.67 0.60 13041
5 0.59 0.10 0.17 1401
6 0.71 0.53 0.61 3445
7 0.52 0.75 0.62 18697
8 0.44 0.38 0.41 9043
9 0.00 0.00 0.00 193
10 0.42 0.11 0.17 1228
11 0.86 0.62 0.72 196
12 0.95 0.92 0.93 8388
13 0.85 0.82 0.83 13724
14 0.00 0.00 0.00 29
15 0.85 0.89 0.87 5670
16 0.00 0.00 0.00 637
17 0.87 0.59 0.71 5193
18 0.73 0.59 0.65 11725
19 0.00 0.00 0.00 3
20 0.80 0.14 0.24 111
21 0.93 0.94 0.94 20100
22 0.78 0.69 0.73 4392
23 0.93 0.96 0.94 15093
24 0.93 0.96 0.94 10580
25 0.90 0.86 0.88 3483
26 1.00 0.03 0.06 257
27 0.76 0.63 0.69 2883
accuracy 0.77 160391
macro avg 0.67 0.54 0.56 160391
weighted avg 0.77 0.77 0.76 160391
=== Evaluating Logistic Regression on Test Set ===
Accuracy: 0.7650678654039191
ROC AUC (macro): 0.9532511274695313
Classification Report:
precision recall f1-score support
0 0.95 0.99 0.97 1702
1 0.94 0.94 0.94 1940
2 0.83 0.70 0.76 63
3 0.62 0.29 0.40 957
4 0.56 0.68 0.61 5589
5 0.61 0.09 0.16 601
6 0.70 0.54 0.61 1477
7 0.52 0.75 0.62 8013
8 0.44 0.40 0.42 3876
9 0.00 0.00 0.00 83
10 0.37 0.09 0.15 526
11 0.85 0.52 0.65 84
12 0.94 0.92 0.93 3595
13 0.85 0.82 0.83 5881
14 0.00 0.00 0.00 12
15 0.85 0.89 0.87 2430
16 0.00 0.00 0.00 273
17 0.88 0.57 0.69 2225
18 0.72 0.59 0.65 5025
19 0.00 0.00 0.00 2
20 1.00 0.10 0.19 48
21 0.93 0.94 0.94 8614
22 0.79 0.69 0.74 1883
23 0.93 0.96 0.94 6468
24 0.93 0.96 0.94 4534
25 0.91 0.87 0.89 1493
26 1.00 0.02 0.04 110
27 0.73 0.62 0.67 1235
accuracy 0.77 68739
macro avg 0.67 0.53 0.56 68739
weighted avg 0.77 0.77 0.76 68739
=== Logistic Regression Predictions on test_df_transformed ===
Predicted categories (first 10): [18 18 13 23 4 4 4 4 21 13]
# plot roc for both training and testing
fig, axes = plt.subplots(1, 2, figsize=(15, 8))
plot_roc_curve_multi_class(lr, X_train_transformed, y_train, ax=axes[0], title=f"{model_name} - Train ROC")
plot_roc_curve_multi_class(lr, X_test_transformed, y_test, ax=axes[1], title=f"{model_name} - Test ROC")
plt.tight_layout()
plt.show()
3. GaussianNB(Naive Bayes Algorithm)¶
print(f"Training GaussianNB...")
mnb = GaussianNB()
mnb.fit(X_train_transformed, y_train)
models['GaussianNB'] = mnb
print(f"GaussianNB Done")
Training GaussianNB... GaussianNB Done
model_name = "GaussianNB"
print(f"\n=== Evaluating {model_name} on Train Set ===")
train_eval = evaluate_model(mnb, X_train_transformed, y_train)
print("Accuracy:", train_eval['accuracy'])
print("ROC AUC (macro):", train_eval['roc_auc_ovr'])
print("Classification Report:\n", train_eval['classification_report'])
print(f"\n=== Evaluating {model_name} on Test Set ===")
test_eval = evaluate_model(mnb, X_test_transformed, y_test)
print("Accuracy:", test_eval['accuracy'])
print("ROC AUC (macro):", test_eval['roc_auc_ovr'])
print("Classification Report:\n", test_eval['classification_report'])
# save the model in the model's folder
joblib.dump(mnb, f"models/{model_name}_model.pkl")
if test_df_transformed is not None and test_df_transformed.shape[0] > 0:
print(f"\n=== {model_name} Predictions on test_df_transformed ===")
pred_uncat = mnb.predict(test_df_transformed)
print("Predicted categories (first 10):", pred_uncat[:10])
results[model_name] = {
'train_eval': train_eval,
'test_eval': test_eval
}
=== Evaluating GaussianNB on Train Set ===
Accuracy: 0.5495071419219283
ROC AUC (macro): 0.9196774599226875
Classification Report:
precision recall f1-score support
0 0.95 0.83 0.89 3970
1 0.89 0.90 0.89 4528
2 0.77 0.87 0.82 148
3 0.26 0.38 0.31 2233
4 0.65 0.36 0.46 13041
5 0.14 0.49 0.22 1401
6 0.47 0.59 0.52 3445
7 0.73 0.38 0.50 18697
8 0.57 0.27 0.36 9043
9 0.04 0.20 0.07 193
10 0.13 0.19 0.15 1228
11 0.89 0.61 0.72 196
12 0.85 0.81 0.83 8388
13 0.68 0.63 0.65 13724
14 0.02 0.55 0.03 29
15 0.79 0.69 0.74 5670
16 0.01 0.33 0.02 637
17 0.86 0.54 0.66 5193
18 0.61 0.27 0.38 11725
19 0.06 1.00 0.11 3
20 0.06 0.70 0.10 111
21 0.78 0.58 0.67 20100
22 0.43 0.51 0.47 4392
23 0.83 0.67 0.74 15093
24 0.74 0.74 0.74 10580
25 0.81 0.76 0.79 3483
26 0.04 0.25 0.07 257
27 0.17 0.75 0.27 2883
accuracy 0.55 160391
macro avg 0.51 0.57 0.47 160391
weighted avg 0.70 0.55 0.60 160391
=== Evaluating GaussianNB on Test Set ===
Accuracy: 0.5500952879733485
ROC AUC (macro): 0.8995635053938663
Classification Report:
precision recall f1-score support
0 0.95 0.82 0.88 1702
1 0.88 0.90 0.89 1940
2 0.76 0.86 0.81 63
3 0.24 0.34 0.28 957
4 0.67 0.37 0.47 5589
5 0.15 0.51 0.23 601
6 0.47 0.60 0.52 1477
7 0.73 0.38 0.50 8013
8 0.58 0.28 0.37 3876
9 0.04 0.22 0.07 83
10 0.11 0.18 0.14 526
11 0.94 0.54 0.68 84
12 0.86 0.82 0.84 3595
13 0.67 0.63 0.65 5881
14 0.02 0.58 0.03 12
15 0.78 0.70 0.74 2430
16 0.01 0.31 0.02 273
17 0.87 0.52 0.65 2225
18 0.60 0.28 0.38 5025
19 0.00 0.00 0.00 2
20 0.05 0.54 0.08 48
21 0.79 0.59 0.67 8614
22 0.45 0.53 0.49 1883
23 0.84 0.66 0.74 6468
24 0.73 0.75 0.74 4534
25 0.83 0.74 0.78 1493
26 0.03 0.19 0.06 110
27 0.17 0.74 0.27 1235
accuracy 0.55 68739
macro avg 0.51 0.52 0.46 68739
weighted avg 0.70 0.55 0.60 68739
=== GaussianNB Predictions on test_df_transformed ===
Predicted categories (first 10): [13 13 13 16 3 3 3 3 15 15]
# plot roc for both training and testing
fig, axes = plt.subplots(1, 2, figsize=(15, 8))
plot_roc_curve_multi_class(mnb, X_train_transformed, y_train, ax=axes[0], title=f"{model_name} - Train ROC")
plot_roc_curve_multi_class(mnb, X_test_transformed, y_test, ax=axes[1], title=f"{model_name} - Test ROC")
plt.tight_layout()
plt.show()
Models Performance Comparison and Results Interpretation¶
model_performance = {}
for model_name, model_data in results.items():
test_eval = model_data['test_eval']
y_pred = test_eval['y_pred']
y_true = y_test
# calculate precision, recall, f1 from confusion matrix
conf_matrix = confusion_matrix(y_true, y_pred)
tp = np.diag(conf_matrix)
fn = conf_matrix.sum(axis=1) - tp
fp = conf_matrix.sum(axis=0) - tp
tn = conf_matrix.sum() - (tp + fn + fp)
precision = tp.sum() / (tp.sum() + fp.sum()) if (tp.sum() + fp.sum()) > 0 else 0
recall = tp.sum() / (tp.sum() + fn.sum()) if (tp.sum() + fn.sum()) > 0 else 0
f1_score = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
roc_auc = test_eval['roc_auc_ovr']
model_performance[model_name] = {
'precision': precision,
'recall': recall,
'f1_score': f1_score,
'roc_auc_ovr': roc_auc
}
# show the performance of every model
for model_name, metrics in model_performance.items():
print(f"\nPerformance for {model_name}:")
for metric_name, value in metrics.items():
print(f"{metric_name}: {value:.4f}")
Performance for Random Forest: precision: 0.9046 recall: 0.9046 f1_score: 0.9046 roc_auc_ovr: 0.9556 Performance for Logistic Regression: precision: 0.7651 recall: 0.7651 f1_score: 0.7651 roc_auc_ovr: 0.9533 Performance for GaussianNB: precision: 0.5501 recall: 0.5501 f1_score: 0.5501 roc_auc_ovr: 0.8996
Models Performance Comparison Plot on Test data¶
model_names = list(results.keys())
metrics = ['accuracy', 'roc_auc_ovr', 'precision', 'recall', 'f1_score']
metric_values = {metric: [] for metric in metrics}
for model in model_names:
test_eval = results[model]['test_eval']
metric_values['accuracy'].append(test_eval['accuracy'])
metric_values['roc_auc_ovr'].append(test_eval['roc_auc_ovr'])
conf_matrix = test_eval['confusion_matrix']
# calculate precision, recall, f1 from confusion matrix
tp = np.diag(conf_matrix)
fn = conf_matrix.sum(axis=1) - tp
fp = conf_matrix.sum(axis=0) - tp
tn = conf_matrix.sum() - (tp + fn + fp)
precision = tp.sum() / (tp.sum() + fp.sum())
recall = tp.sum() / (tp.sum() + fn.sum())
f1_score = 2 * (precision * recall) / (precision + recall)
metric_values['precision'].append(precision)
metric_values['recall'].append(recall)
metric_values['f1_score'].append(f1_score)
metric_values
{'accuracy': [0.9046247399583933, 0.7650678654039191, 0.5500952879733485],
'roc_auc_ovr': [0.9555875199392778, 0.9532511274695313, 0.8995635053938663],
'precision': [0.9046247399583933, 0.7650678654039191, 0.5500952879733485],
'recall': [0.9046247399583933, 0.7650678654039191, 0.5500952879733485],
'f1_score': [0.9046247399583932, 0.7650678654039191, 0.5500952879733485]}
palette = sns.color_palette("husl", len(model_names))
fig, ax = plt.subplots(figsize=(14, 8))
bar_width = 0.15
x_indexes = np.arange(len(metrics))
for i, model in enumerate(model_names):
metric_values_for_model = [metric_values[metric][i] for metric in metrics]
ax.bar(
x_indexes + i * bar_width,
metric_values_for_model,
width=bar_width,
label=model,
color=palette[i],
edgecolor='black',
alpha=0.9
)
ax.set_xlabel("Metrics", fontsize=14, labelpad=10)
ax.set_ylabel("Values", fontsize=14, labelpad=10)
ax.set_title("Model Performance Comparison", fontsize=18, weight='bold', pad=20)
ax.set_xticks(x_indexes + bar_width)
ax.set_xticklabels(metrics, fontsize=12, weight='bold')
ax.legend(fontsize=12, title="Models", loc='upper left', bbox_to_anchor=(1, 1))
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
# add values on top of the bars
for i, model in enumerate(model_names):
for j, metric in enumerate(metrics):
ax.text(
x_indexes[j] + i * bar_width,
metric_values[metric][i] + 0.01,
f"{metric_values[metric][i]:.2f}",
ha='center',
fontsize=10,
color='black',
weight='bold'
)
# background color for customizing the plot
fig.patch.set_facecolor('#f7f7f7')
ax.set_facecolor('#f7f7f7')
plt.tight_layout()
plt.show()
Interpretation of Results¶
The Random Forest model is clearly the best performer among the three models based on the provided metrics:
- Performance Metrics Comparison
| Model | Precision | Recall | F1-Score | ROC-AUC-OVR | |----------------------|-----------|---------|----------|-------------| | Random Forest | 0.9046 | 0.9046 | 0.9046 | 0.9556 | | Logistic Regression | 0.7651 | 0.7651 | 0.7651 | 0.9533 | | GaussianNB | 0.5501 | 0.5501 | 0.5501 | 0.8996 |
- Key Observations
- Random Forest:
- Achieves the highest precision, recall, F1-score, and ROC-AUC across all models.
- It handles imbalanced data better due to its ability to learn complex relationships and decision boundaries. This makes it effective for both majority and minority classes.
Why it’s best: The model balances performance across all metrics, making it reliable for both accurate and balanced predictions.
- Logistic Regression:
- Performs reasonably well, especially in terms of ROC-AUC (0.9533), which is close to Random Forest.
- However, its precision, recall, and F1-score are significantly lower, likely due to its inability to model complex, non-linear relationships inherent in the dataset.
- GaussianNB:
- Performs the worst across all metrics.
- Naive Bayes assumes feature independence, which is likely violated in this dataset where amount, description, and other features interact in complex ways.
Model Explainability and Interpretability using LIME and PDPs¶
Why I Used These:
LIME:To explain why the model made a specific prediction for a single transaction. It’s great for understanding individual decisions.PartialDependenceDisplay:To see how a feature, like “amount” or “description,” affects predictions across all transactions. It helps find overall patterns.
When We Use These:
LIME:When you want to explain or debug the model’s decision for a single case, like why a transaction was labeled “Loans.”PartialDependenceDisplay:When you want to understand how a feature impacts predictions across the whole dataset.
What the Outputs Are:
LIME:Shows which features were most important for a single prediction and how they influenced the result (e.g., positively or negatively).PartialDependenceDisplay:Creates graphs that show how a feature affects predictions overall (average effect) and for individual cases (variability).
LIME
# create feature names
feature_names = [f"Feature {i}" for i in range(X_train_transformed.shape[1])]
# LimeTabularExplainer on X_train_transformed with feature names
explainer = LimeTabularExplainer(
training_data=X_train_transformed,
feature_names=feature_names,
class_names=[str(cls) for cls in np.unique(y_train)],
mode="classification"
)
i = 0
instance = X_test_transformed[i]
# explain the prediction of instance i
explanation = explainer.explain_instance(
data_row=instance,
predict_fn=rf.predict_proba,
num_features=20,
top_labels=1
)
explanation.show_in_notebook(show_table=True)
# save the explanation to a file
explanation.save_to_file('lime_explanation.html')
Partial Dependence Plots (PDPs)
feature_indices = [0, 1, 2]
target_class = 0
fig, ax = plt.subplots(len(feature_indices), 1, figsize=(10, 5 * len(feature_indices)))
for i, feature_idx in enumerate(feature_indices):
PartialDependenceDisplay.from_estimator(
rf,
X_test_transformed,
features=[feature_idx],
target=target_class,
kind="both",
ax=ax[i] if len(feature_indices) > 1 else ax,
line_kw={"color": "blue"},
pd_line_kw={"color": "red", "linewidth": 2},
)
ax[i].set_title(f"PDP and ICE for Feature {feature_idx} (Target Class: {target_class})", fontsize=14)
ax[i].set_ylabel("Predicted Value", fontsize=12)
ax[i].set_xlabel(f"Feature {feature_idx}", fontsize=12)
plt.tight_layout()
plt.show()
Prediction Based on New User Data¶
# load feature engineering pipeline and best model
feature_pipeline = joblib.load("models/feature_engineering_pipeline.pkl")
best_model = joblib.load("models/Random Forest_model.pkl")
new_user_data = pd.DataFrame({
"txn_date": ["2025-01-01"],
"description": ["Payment to ABC Store"],
"amount": [10.5],
"is_interested_investment": [0],
"is_interested_build_credit": [1],
"is_interested_increase_income": [0],
"is_interested_pay_off_debt": [1],
"is_interested_managed_spending": [1],
"is_interested_grow_savings": [0]
})
# apply the feature pipeline to the new user data
new_user_data_transformed = feature_pipeline.transform(new_user_data)
# get the prediction and probabilities
predicted_category = best_model.predict(new_user_data_transformed)
predicted_proba = best_model.predict_proba(new_user_data_transformed)
#
print("Predicted Category:", predicted_category[0])
print("Prediction Probabilities:", predicted_proba)
Predicted Category: 13 Prediction Probabilities: [[0. 0.01 0. 0.01 0.04 0.01 0.05 0.03 0.01 0.04 0.04 0.01 0. 0.22 0. 0.13 0.01 0.03 0.01 0. 0. 0.16 0.01 0.03 0.07 0.06 0.01 0.01]]
Convert the predicted category back to original category
predicted_category = le.inverse_transform(predicted_category)
predicted_category
array(['Loans'], dtype=object)
Steps to Enhance Model Performance¶
Hyper-Parameter Tuning: Optimize the Random Forest model's settings using
GridSearchCVorRandomizedSearchCV.Address Class Imbalance: Utilize
SMOTEor class weighting to better manage imbalanced categories during training.Feature Engineering: Implement
word embeddingsfor the description column to enrich text data and analyze feature importance to create impactful features.Ensemble Models: Combine different models like Random Forest, XGBoost, and Logistic Regression using techniques like
stackingorblendingto maximize strengths.Deploy the Model: Save the best model and set up APIs for real-time predictions.
Monitor and Iterate: Regularly collect new data, retrain the model, and track its performance for continuous improvement.over time.
Code for Notebook Customization¶
from IPython.core.display import HTML
style = """
<style>
body {
background-color: #f2fff2;
}
h1 {
text-align: center;
font-weight: bold;
font-size: 36px;
color: #4295F4;
text-decoration: underline;
padding-top: 15px;
}
h2 {
text-align: left;
font-weight: bold;
font-size: 30px;
color: #4A000A;
text-decoration: underline;
padding-top: 10px;
}
h3 {
text-align: left;
font-weight: bold;
font-size: 30px;
color: #f0081e;
text-decoration: underline;
padding-top: 5px;
}
p {
text-align: center;
font-size: 12 px;
color: #0B9923;
}
</style>
"""
html_content = """
<h1>Hello</h1>
<p>Hello World</p>
<h2> Hello</h2>
<h3> World </h3>
"""
HTML(style + html_content)
Hello
Hello World
Hello
World